1 Background

We hope to explore the relative influence of physical traits, environmental conditions and species identity on the growth rate of trees. We have two measures of growth rate - basal area and biomass. A gradient boosted model seems like a good candidate for this work since they:

1.1 About Gradient Boosted Models

A gradient boosted machine/model is a machine learning model that uses decision trees to fit the data.

A decision tree first starts with all of the observations, then, from the variables provided, it tries to figure out which variable split would result in the “purest” groupings of the data. So, in this case, it would try to place rows with higher growth rates in one node, and those with lower growth rates in another node.

GBMs are an ensemble of decision trees, nut they are fit sequentially. We call GBMs an ensemble of weak learners as each subsequent tree is an attempt to correct the errors of the previous tree. Thus, while one tree, by itself, can not describe the relationships, with the use of all the trees, we can. Below is a figure by Bradly Bohemke that attempts to illustrate how each subsequent tree improves the fit on the data. Boosted regression decision stumps as 0-1024 successive trees are added

1.2 Changes made to raw data

Below is a list of the changes made to the raw data:

  • Environmental variables - these are the principle components (1-5) from a PCA that was fit on all the environmental variables. We visualized the PCA and used the eginvectors to help figure which environmental condition best explained that PC. There were 5 - Soil Fertility, Light, Temperature, pH, Soil.Humidity.Depth, and Slope.
  • Remove the influence of sampling date - some of the variables vary with when they were measured. Assuming a linear relationship between sample date and the variables, we extracted the residual from a lm model with Variable ~ Sample date when the value<0.05. This resulted in changes for the following variables - Estem, Branching Distance, Stem Wood Density, LMA, LNC, d15N, Ks, X.Sapwood, d13C, Soil.Fertility, Light, Temperature, pH

2 Data Preparation

First, we can import data and select the labels needed for plotting later.

# prior error suggests that there are single quotation marks used to parse numbers
# I specify that we should read those as a thousand separator
rgr_msh_residuals_julian_df <- read_csv(here("data/rgr_msh_residuals_julian_df.csv"),
                        guess_max = 10000,
                        col_types = cols())
# these are the labels for the variables - needed for plotting
labels_rgr_msh <- read_csv(here("data/labels.csv"),
                        guess_max = 10000,
                        col_types = cols())
# this helps in listing the variables we actually need to keep, depending on the analyses
PlantTraitsKeep <- labels_rgr_msh %>% filter(Class == 1) %>% select(Feature) %>% reduce(c)
EnvironmentalVariablesKeep <- c( "Soil.Fertility", "Light", "Temperature",  
                                    "pH", "Soil.Humidity.Depth ", "Slope")
OthersKeep <- labels_rgr_msh %>% filter(Class%in% c(3,4, 5,7) )%>% select(Feature) %>% reduce(c) 
OthersKeep <- c("SampleID", "Site", "Species", "julian.date.2011", 
                OthersKeep[!OthersKeep%in% c("SampleID", "Site", "Species", "julian.date.2011")])

Now, we can create the training and test set - 70:30 is default, usually.

# need a training and test set assess the performance of the model
# a 70:30 split is typical
# first, I will work with the imputed data
set.seed(634)
split_train_test <-
  initial_split(
    data = rgr_msh_residuals_julian_df, 
    prop = 0.70) 

rgr_msh_na <-
  split_train_test %>% training() %>% mutate(Group = "Train")%>%
  bind_rows(split_train_test %>% testing() %>% mutate(Group = "Test"))

3 Gradient Boosted Models

3.1 BIO ~ PlantTraits + TreeAge + EnvironmentalVariables

3.1.1 Tune Model

# training the model using H20 - made a function that allows us to do this
# takes ~ 5 minutes
startTime <- Sys.time()
gbmParmBIONoDate <-
  h20_gbm_tune(path = here("data/RGRH20NoDateBIO.csv"),
             status = "BIO_GR",
             type = "integer")
endTime <- Sys.time()
endTime-startTime

3.1.2 Model Parameters

First, we look at the best parameters from tuning.

## $model_id
## [1] "final_grid_model_92"
## 
## $training_frame
## [1] "train.hex"
## 
## $validation_frame
## [1] "valid.hex"
## 
## $score_tree_interval
## [1] 10
## 
## $ntrees
## [1] 10000
## 
## $max_depth
## [1] 6
## 
## $min_rows
## [1] 1
## 
## $nbins
## [1] 128
## 
## $nbins_cats
## [1] 2048
## 
## $stopping_rounds
## [1] 5
## 
## $stopping_metric
## [1] "MSE"
## 
## $stopping_tolerance
## [1] 1e-04
## 
## $max_runtime_secs
## [1] 3433.695
## 
## $seed
## [1] 1234
## 
## $learn_rate
## [1] 0.05
## 
## $learn_rate_annealing
## [1] 0.99
## 
## $distribution
## [1] "gaussian"
## 
## $sample_rate
## [1] 0.54
## 
## $col_sample_rate
## [1] 0.39
## 
## $col_sample_rate_per_tree
## [1] 0.76
## 
## $min_split_improvement
## [1] 0
## 
## $histogram_type
## [1] "UniformAdaptive"
## 
## $categorical_encoding
## [1] "Enum"
## 
## $x
##  [1] "Soil.Fertility"     "Light"              "Temperature"        "pH"                 "Slope"             
##  [6] "Estem"              "Branching.Distance" "Stem.Wood.Density"  "Leaf.Area"          "LMA"               
## [11] "LCC"                "LNC"                "LPC"                "d15N"               "t.b2"              
## [16] "Ks"                 "Ktwig"              "Huber.Value"        "X.Lum"              "VD"                
## [21] "X.Sapwood"          "d13C"               "Tree.Age"          
## 
## $y
## [1] "BIO_GR"

3.1.3 Build Model

Now, we can build the model.

set.seed(123)
gbm_regressor_bio_residuals <-
  gbm(BIO_GR ~ .,
      data = 
        rgr_msh_na %>% filter(Group == "Train")%>% filter(!is.na(BIO_GR))%>%
        select(any_of(c(EnvironmentalVariablesKeep, PlantTraitsKeep, "Tree.Age", "BIO_GR"))),
      n.trees = 1000,
      interaction.depth = 6, # max depth
      shrinkage = 0.05, #learning rate
      n.minobsinnode = 17, #col_sample_rate 
      bag.fraction = 0.54, # sample_rate,
      verbose = FALSE,
      n.cores = NULL,
      cv.folds = 5)

3.1.4 Relative Importance

First, we look at the importance of variables in the model.

3.1.5 Partial Dependence

Assessing how, when we hold everything else constant, what the relationships are between growth rate and the predictor.

Soil Fertility

Light

Temperature

pH

Slope

Modulus of Elasticity for Stem

Branching Distance

Stem Wood Density

Leaf Area

Leaf Mass per Area

Leaf Carbon Concentration

Leaf Nitrogen Concentration

Leaf Phosphorus Concentration

Delta 15N

Thickness to Span Ratio

Conductivity per Sapwood Area

Conductivity per Branch

Huber Value

Percent Lumen

Vessel Diameter

Percent Sapwood

Delta13C

Tree Age

3.1.6 Performance - True Traits Value

How does the model perform when we use the true individual trait value?

3.1.7 Performance - Species Means for Trait Value

Does the test error change when we use the average for that species?

3.1.8 Interactions | Table

Let’s explore the interactions in these data.

3.1.9 Interactions | Plots

Now, we plot interactions with values>0.1.

Leaf Nitrogen Concentration:Tree Age

Temperature:Stem Wood Density

Leaf Mass per Area:Delta13C

pH:Leaf Carbon Concentration

Leaf Mass per Area:Huber Value

Stem Wood Density:Leaf Nitrogen Concentration

Branching Distance:Modulus of Elasticity for Stem

Percent Lumen:Tree Age

Slope:Leaf Mass per Area

Leaf Mass per Area:Delta 15N

Slope:Modulus of Elasticity for Stem

Leaf Nitrogen Concentration:Leaf Phosphorus Concentration

Leaf Area:Leaf Mass per Area

Huber Value:Stem Wood Density

Branching Distance:Thickness to Span Ratio

Modulus of Elasticity for Stem:Thickness to Span Ratio

3.1.10 Group Relative Importance

Finally, we compare the relative importance of the various groups - tree age, plant traits, and environmental conditions.

3.2 BAI ~ PlantTraits + TreeAge + EnvironmentalVariables

3.2.1 Tune Model

# training the model using H20
startTime <- Sys.time()
gbmParmBAINoDate <-
  h20_gbm_tune(path = here("data/RGRH20NoDateBAI.csv"),
             status = "BAI_GR",
             type = "integer")
endTime <- Sys.time()
endTime-startTime

3.2.2 Model Parameters

First, we look at the best parameters from tuning.

## $model_id
## [1] "final_grid_model_3"
## 
## $training_frame
## [1] "train.hex"
## 
## $validation_frame
## [1] "valid.hex"
## 
## $score_tree_interval
## [1] 10
## 
## $ntrees
## [1] 10000
## 
## $max_depth
## [1] 14
## 
## $min_rows
## [1] 2
## 
## $nbins
## [1] 512
## 
## $nbins_cats
## [1] 128
## 
## $stopping_rounds
## [1] 5
## 
## $stopping_metric
## [1] "MSE"
## 
## $stopping_tolerance
## [1] 1e-04
## 
## $max_runtime_secs
## [1] 3593.317
## 
## $seed
## [1] 1234
## 
## $learn_rate
## [1] 0.05
## 
## $learn_rate_annealing
## [1] 0.99
## 
## $distribution
## [1] "gaussian"
## 
## $sample_rate
## [1] 0.36
## 
## $col_sample_rate
## [1] 0.96
## 
## $col_sample_rate_per_tree
## [1] 0.2
## 
## $min_split_improvement
## [1] 1e-04
## 
## $histogram_type
## [1] "RoundRobin"
## 
## $categorical_encoding
## [1] "Enum"
## 
## $x
##  [1] "Soil.Fertility"     "Light"              "Temperature"        "pH"                 "Slope"             
##  [6] "Estem"              "Branching.Distance" "Stem.Wood.Density"  "Leaf.Area"          "LMA"               
## [11] "LCC"                "LNC"                "LPC"                "d15N"               "t.b2"              
## [16] "Ks"                 "Ktwig"              "Huber.Value"        "X.Lum"              "VD"                
## [21] "X.Sapwood"          "d13C"               "Tree.Age"          
## 
## $y
## [1] "BAI_GR"

3.2.3 Build Model

Now, we can build the model.

set.seed(123)
gbm_regressor_bai_residuals <-
  gbm(BAI_GR ~ .,
      data = 
        rgr_msh_na %>% filter(Group == "Train")%>% filter(!is.na(BAI_GR))%>%
        select(any_of(c(EnvironmentalVariablesKeep, PlantTraitsKeep, "Tree.Age", "BAI_GR"))),
      n.trees = 1000,
      interaction.depth = 14, #max depth 
      shrinkage = 0.05, #learning rate
      n.minobsinnode = 23, #col_sample_rate 
      bag.fraction = 0.36, # sample_rate,
      verbose = FALSE,
      n.cores = NULL,
      cv.folds = 5)

3.2.4 Relative Importance

First, we look at the importance of variables in the model.

3.2.5 Partial Dependence

Assessing how, when we hold everything else constant, what the relationships are between growth rate and the predictor.

Soil Fertility

Light

Temperature

pH

Slope

Modulus of Elasticity for Stem

Branching Distance

Stem Wood Density

Leaf Area

Leaf Mass per Area

Leaf Carbon Concentration

Leaf Nitrogen Concentration

Leaf Phosphorus Concentration

Delta 15N

Thickness to Span Ratio

Conductivity per Sapwood Area

Conductivity per Branch

Huber Value

Percent Lumen

Vessel Diameter

Percent Sapwood

Delta13C

Tree Age

3.2.6 Performance - True Traits Value

How does the model perform when we use the true individual trait value?

3.2.7 Performance - Species Means for Trait Value

Does the test error change when we use the average for that species?

3.2.8 Interactions | Table

Let’s explore the interactions in these data.

3.2.9 Interactions | Plots

Now, we plot interactions with values>1.01763794414045e-14

Temperature:Conductivity per Sapwood Area

Conductivity per Sapwood Area:Conductivity per Branch

Conductivity per Sapwood Area:Huber Value

Conductivity per Sapwood Area:Percent Lumen

Slope:Conductivity per Sapwood Area

Leaf Carbon Concentration:Conductivity per Sapwood Area

Temperature:Percent Lumen

3.2.10 Group Relative Importance

Finally, we compare the relative importance of the various groups - tree age, plant traits, and environmental conditions.

3.3 BIO ~ SpeciesIdentity + TreeAge + EnvironmentalVariables

3.3.1 Tune Model

startTime <- Sys.time()
gbmParmBIONoDateSpecies <-
  h20_gbm_tune(path = here("data/RGRH20NoDateBIOSpecies.csv"),
             status = "BIO_GR",
             type = "integer")
endTime <- Sys.time()
endTime-startTime

3.3.2 Model Parameters

First, we look at the best parameters from tuning.

## $model_id
## [1] "final_grid_model_24"
## 
## $training_frame
## [1] "train.hex"
## 
## $validation_frame
## [1] "valid.hex"
## 
## $score_tree_interval
## [1] 10
## 
## $ntrees
## [1] 10000
## 
## $max_depth
## [1] 11
## 
## $min_rows
## [1] 1
## 
## $nbins
## [1] 512
## 
## $nbins_cats
## [1] 2048
## 
## $stopping_rounds
## [1] 5
## 
## $stopping_metric
## [1] "MSE"
## 
## $stopping_tolerance
## [1] 1e-04
## 
## $max_runtime_secs
## [1] 3587.823
## 
## $seed
## [1] 1234
## 
## $learn_rate
## [1] 0.05
## 
## $learn_rate_annealing
## [1] 0.99
## 
## $distribution
## [1] "gaussian"
## 
## $sample_rate
## [1] 0.65
## 
## $col_sample_rate
## [1] 0.63
## 
## $col_sample_rate_per_tree
## [1] 0.83
## 
## $min_split_improvement
## [1] 1e-04
## 
## $histogram_type
## [1] "QuantilesGlobal"
## 
## $categorical_encoding
## [1] "Enum"
## 
## $x
## [1] "Soil.Fertility" "Light"          "Temperature"    "pH"             "Slope"          "Species"       
## [7] "Tree.Age"      
## 
## $y
## [1] "BIO_GR"

3.3.3 Build Model

Now, we can build the model.

set.seed(123)
gbm_regressor_bio_residuals_species <-
  gbm(BIO_GR ~ .,
      data = 
        rgr_msh_na %>% filter(Group == "Train")%>% filter(!is.na(BIO_GR))%>%
        select(any_of(c(EnvironmentalVariablesKeep, "Species", "Tree.Age", "BIO_GR"))) %>%
        mutate(Species = factor(Species)),
      n.trees = 1000,
      interaction.depth = 11, # max depth
      shrinkage = 0.05, #learning rate
      n.minobsinnode = 5, #col_sample_rate 
      bag.fraction = 0.65, # sample_rate,
      verbose = FALSE,
      n.cores = NULL,
      cv.folds = 5)

3.3.4 Relative Importance

First, we look at the importance of variables in the model.

3.3.5 Partial Dependence

Assessing how, when we hold everything else constant, what the relationships are between growth rate and the predictor.

Soil Fertility

Light

Temperature

pH

Slope

Species

Tree Age

3.3.6 Performance - True Traits Value

How does the model perform when we use the true individual trait value?

3.3.7 Interactions | Table

Let’s explore the interactions in these data.

InteractionsGBMBIOSpecies <-
  do.call(rbind,exploreinteractionsBIOSpecies) %>%
  filter(Target != Source) %>%
  arrange(desc(Value)) %>%
  filter(row_number() %% 2 == 1)

DT::datatable(InteractionsGBMBIOSpecies %>% arrange(desc(Value)))

3.3.8 Interactions | Plots

Now, we plot interactions with values>0.15.

Species:Tree Age

Soil Fertility:Species

Temperature:Species

Temperature:Tree Age

Slope:Tree Age

Slope:Species

Light:Species

Light:Tree Age

pH:Species

pH:Tree Age

3.3.9 Group Relative Importance

Finally, we compare the relative importance of the various groups - tree age, plant traits, and environmental conditions.

3.4 BAI ~ SpeciesIdentity + TreeAge + EnvironmentalVariables

3.4.1 Tune Model

startTime <- Sys.time()
gbmParmBAINoDateSpecies <-
  h20_gbm_tune(path = here("data/RGRH20NoDateBAISpecies.csv"),
             status = "BAI_GR",
             type = "integer")
endTime <- Sys.time()
endTime-startTime

3.4.2 Model Parameters

First, we look at the best parameters from tuning.

## $model_id
## [1] "final_grid_model_86"
## 
## $training_frame
## [1] "train.hex"
## 
## $validation_frame
## [1] "valid.hex"
## 
## $score_tree_interval
## [1] 10
## 
## $ntrees
## [1] 10000
## 
## $max_depth
## [1] 9
## 
## $min_rows
## [1] 1
## 
## $nbins
## [1] 32
## 
## $nbins_cats
## [1] 64
## 
## $stopping_rounds
## [1] 5
## 
## $stopping_metric
## [1] "MSE"
## 
## $stopping_tolerance
## [1] 1e-04
## 
## $max_runtime_secs
## [1] 3530.433
## 
## $seed
## [1] 1234
## 
## $learn_rate
## [1] 0.05
## 
## $learn_rate_annealing
## [1] 0.99
## 
## $distribution
## [1] "gaussian"
## 
## $sample_rate
## [1] 0.7
## 
## $col_sample_rate
## [1] 0.81
## 
## $col_sample_rate_per_tree
## [1] 0.2
## 
## $min_split_improvement
## [1] 1e-04
## 
## $histogram_type
## [1] "RoundRobin"
## 
## $categorical_encoding
## [1] "Enum"
## 
## $x
## [1] "Soil.Fertility" "Light"          "Temperature"    "pH"             "Slope"          "Species"       
## [7] "Tree.Age"      
## 
## $y
## [1] "BAI_GR"

3.4.3 Build Model

Now, we can build the model.

set.seed(123)
gbm_regressor_bai_residuals_species <-
   gbm(BAI_GR ~ .,
      data = 
        rgr_msh_na %>% filter(Group == "Train")%>% filter(!is.na(BAI_GR))%>%
        select(any_of(c(EnvironmentalVariablesKeep, "Species", "Tree.Age", "BAI_GR"))) %>%
        mutate(Species = factor(Species)),
      n.trees = 1000,
      interaction.depth = 9, # max depth
      shrinkage = 0.05, #learning rate
      n.minobsinnode = 4, #col_sample_rate 
      bag.fraction = 0.81, # sample_rate,
      verbose = FALSE,
      n.cores = NULL,
      cv.folds = 5)

3.4.4 Relative Importance

First, we look at the importance of variables in the model.

3.4.5 Partial Dependence

Assessing how, when we hold everything else constant, what the relationships are between growth rate and the predictor.

Soil Fertility

Light

Temperature

pH

Slope

Species

Tree Age

3.4.6 Performance

How does the model perform?

3.4.7 Interactions | Table

Let’s explore the interactions in these data.

3.4.8 Interactions | Plots

Now, we plot interactions with values>0.15.

Species:Tree Age

Temperature:Species

Soil Fertility:Species

Slope:Species

Soil Fertility:Slope

Temperature:Tree Age

Light:Species

Light:Slope

Soil Fertility:pH

3.4.9 Group Relative Importance

Finally, we compare the relative importance of the various groups - tree age, plant traits, and environmental conditions.